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METHOD TO EXTEND OPERATING RANGE OF JOINT ADDITIVE AND 
CONVOLUTIVE COMPENSATING ALGORITHMS 

FIELD OF INVENTION: 

[0001] This invention relates to speech recognition and more particularly to joint 

additive acoustic noise and convolutive channel noise compensating algorithms for 
speech recognition. 
BACKGROUND OF INVENTION: 

[0002] A speech recognizer trained on clean speech data and operating in 

different environments has lower performance due to at least the two distortion sources of 
background noise and microphone or channel changes. Handling simultaneously the two 
is critical to the performance of the recognizer. 

[0003] There are many front-end solutions that have been developed and have 

shown to give promising results for connected digit recognition applications in very noisy 
environments. See references 2 and 3. For instance, there are methods such as ETSI 
advanced DSR front-end that handles both channel distortion and background noise. See 
D.Macho, L. Mauuary, B. Noe, Y. M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. 
Pearce, and F. Saadoun, "Evaluation of a Noise- robust DSR front-end on AURORA 
databases," Proc. Int. Conf. on Spoken Language Processing, Colorado, UAS, September 
2002, pp 17-20. These techniques do not require any noise training data. To be effective 
in noise reduction, they typically require an accurate instantaneous estimate of the noise 
spectrum. 
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[0004] Alternate solutions consist, instead, of modifying the back-end of the 

recognizer to compensate for the mismatch between the training and recognition 
environments. More specifically, in the acoustic model space, a convolutive (e.g. 
channel) component and an additive (e.g. background noise) component can be 
introduced to model the two distortion sources. See the following references: M. Afifty, 
Y. Gong, and J.P. Haton, " A general joint additive and convolutive bias compensation 
approach applied to noisy Lombard speech recognition," IEEE Trans. On Speech and 
Audio Processing, vol. 6, no. 6, pp 524-538, November 1998; J. L. Gauvain, L. Lamel, 
M. Adda-Decker, and D. Matrouf, "Developments in continuous speech dictation using 
the ARPA NAB news task, " in Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal 
Processing, Detroit, 1996, pp 73-76; Y. Minami and S. Furui, "A maximum likelihood 
procedure for a universal adaptation method based on HMM composition," in Proc. of 
IEEE Int. Conf on Acoustics, Speech and Signal Processing, Detroit, 1995, pp 129-132; 
M.J.F. Gales, Model-Based Techniques for Noise Robust Speech Recognition, Ph.D. 
thesis, Cambridge University, U.K., 1995; Y. Gong, " A Robust continuous speech 
recognition system for mobile information devices (invited paper)," in Proc. Of 
International Workshop on Hands-Free Speech Communication, Kyoto, Japan, April 
2001; and Y. Gong, " Model-space compensation of microphone and noise for speaker- 
independent speech recognition," in Proc. of IEEE Int. Conf. on Acoustics, Speech, and 
Signal Processing, Hong Kong, April 2003. The effect of the two distortions introduces in 
the log spectral domain non-linear parameter changes, which can be approximated by 
linear equations. See S. Sagayama, Y.Yamaguchi, and S. Takahashi, " Jacobian 
adaptation of noisy speech models," in Proc. of IEEE Automatic Speech Recognition 
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Workshop, Santa Barbara, CA,USA, Dec. 1997, pp 396-403, IEEE Signal Processing 
Society and N.S. Kim, " Statistical linear approximation for environmental compensation, 
IEEE Signal Processing Letters, vol. 5, no. 1, pp 8-1 1, Jan. 1998. 

[0005] It is desirable to provide a more robust utilization of a framework recently 

developed by Texas Instruments Incorporated known as JAC (Joint compensation of 
Additive and Convolutive distortions). This is described in patent application serial no. 
10/251,734; filed Sept. 20, 2002 of Yifan Gong entitled "Method of Speech Recognition 
Resistant to Convolutive Distortion and Additive Distortion." JAC handles 
simultaneously both background noise and channel distortions for speaker independent 
speech recognition. Joint additive acoustic noise and convolutive channel noise 
compensating algorithms for improved speech recognition in noise are not able to operate 
at very low signal to noise ratios (SNR). This application is incorporated herein by 
reference. The reason lies in the fact that when the compensation mechanism of the 
recognizer is suddenly exposed to a new type of channel noise or to a very low SNR 
signal, inaccurate channel estimates or insufficient background noise compensation will 
degrade the quality of the subsequent channel estimate, which in turn will degrade 
recognition accuracy and channel estimate of the next sentence exposed to the recognizer. 

SUMMARY OF INVENTION 

[0006] In accordance with one embodiment of the present invention a solution to 

this problem includes adding inertia to JAC channel estimate and to force the amplitude 
of the channel estimates to be made as a function of the amount of channel statistics 
already observed. 
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[0007] In accordance with an embodiment of the present invention, in addition to 

adding inertia to JAC channel estimate, forcing the amplitude of the channel estimates to 
be made as a function of the amount of channel statistics already observed (the SNR 
environment) and to force the channel estimate to be within a certain range. 

DESCRIPTION OF DRAWING 

[0008] Figure 1 illustrates a speech recognition system according to prior art JAC 

application. 

[0009] Figure 2 is a flow chart of the operation of the processor in Fig. 1 

according to prior art. 

[0010] Figure 3 illustrates past utterances used in recognizing current utterances. 

[0011] Figure 4 is a block diagram of the environment and statistic dependent 

channel dependent channel estimate according to one embodiment of the present 
invention. 

[0012] Figure 5 is a table of performance of E-JAC back-end on the Aurora-2 

database. 

DESCRIPTION OF PREFERRED EMBODIMENTS OF THE PRESENT INVENTION 

REVIEW OF A JOINT ADDITIVE AND CONVOLUTIVE NOISE COMPENSATION 
(JAC) ALGORITHM 

[0013] A speech signal x(n) can only be observed in a given acoustic 

environment. An acoustic environment can be modeled by a background noise b'(n) and a 
distortion channel h(n). For typical mobile speech recognition, b'(n) consists, for 
instance, of office noise, vehicle engine or road noise, and h(n) consists of the 



4 



TI-37331 

microphone type or its relative position to the speaker. Let y(n) be the speech observed in 
the environment involving b'(n) and h(n) : y(n) = (x(n)+b'(n) )*h(n). In typical speech 
recognition applications, b'(n) can not be measured directly. What is available is b'(n) 
*h(n). Let b(n) =b'(n) *h(n), our model of distorted speech becomes: 

y(n) = x(n)*h(n)+b(n). (1) 
Or, in the power spectrum domain, 

Y(k)=X(k)H(k)+B(k). (2) 
[0014] Representing the above quantities in logarithmic scale, we have: 

Y l (k) = g(X l ,H l ,B l )(k) (3) 
= log(exp(X' (k) + H l (k)) + cxp(B l (k))) (4) 



[0015] Assuming the log-normal distribution and ignoring the variance, we have 

in the acoustic model space, 

E[Y']im l = g(m', //',£'), (5) 

where m l is the original Gaussian mean vector and m l is the Gaussian mean 
vector compensated for the distortions caused by channel H ; and environment noise B'. 

ESTIMATION OF CHANNEL AND NOISE COMPONENTS 

[0016] Our goal is to derive the Hidden Markov Models (HMMs) of Y, the 

speech signal under both additive noise and convolutive distortions. The key problem is 
to obtain an estimate of the channel H 7 and noise B ; . We assume that some speech data 
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recorded in the noisy environment is available and that the starting HMM models for X 
are trained on clean speech in the feature space. 

[0017] Applying the Expectation-Maximization (EM) procedure, it can be shown 

that H 7 and B ; are given by the solution to the equation 

u(H',B')=YHttr;u,k) - (6) 

•{g{m) k ,H\B l )-DFT{o r t )} = 0, 
where y r t (j,k) is the probability of being in state j with mixing component k at time t 
given utterance r, o[ is the observation feature vector at time t for utterance r, and DFT 
is the Discrete Fourier Transform. For EM procedure see A.P. Dempster , N. N.M. Laid, 
and D. B. Rubin, " Maximum Likelihood from incomplete data via the EM algorithm," 
Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1977. For application of 
the procedure see above cited Y. Gong , " Model-space compensation of microphone and 
noise for speaker- independent speech recognition" and Y. Gong, " A method of joint 
compensation of additive and convolutive distortions for speaker-independent speech 
recognition," IEEE Trans, on Speech and Audio Processing, 2002. 

ESTIMATION OF NOISE COMPONENT 

[0018] Equation 6 can be used to solve both H' and B'. However, for this case, 

we assume B l to be stationary, and use the first P non-speech frames as an estimate of B'. 
We calculate an estimate of noise in the log domain B as the average of the P noise 
frames in the log domain 
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B' =±jrDFT(y,)- (7) 
" 1=1 



SOLVING CHANNEL EQUATION 

[0019] To solve H' for u(H' ,B l = B 1 ) = 0,we use Newton's method, which has 

interesting convergence property for on-line estimation of the parameters. The method is 
iterative, which gives a new estimate H' [M] , at iteration i + 1, of H' using 

where u'(H l ,B*) is the derivative of u(H l ,B l ) with respect to channel H'. As the initial 
condition for Eq.8, we can set H l m = 0. 
COMPENSATION FOR TIME DERIVATIVES 

[0020] The distortion caused by channel and noise also affects the distribution of 

dynamic (e.g. time derivative of) MFCC coefficients. According to definition, the 
compensated time derivative of cepstral coefficients Y c is the time derivative of 
compensated cepstral coefficients Y c . See reference above of M. J.F. Gales. It can be 
shown [M. J. F. Gales and Y. Gong last referenced] that both first and second order time 
derivatives are respectively a function of 

T J (k) = ™P(H l (k))y(kl (9) 

where y(k) = — is the SNR in the linear scale at the frequency bin k. 

exp(5 (k)) 
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[0021] In accordance with the JAC invention the estimate comprises the channel 

estimate obtained from the previously recognized utterances and the noise estimate is 
obtained from the pre-utterance pause of the test utterance. Referring to Figure 1 the 
recognizer system 10 includes a recognizer subsystem 11, Hidden Markov Models H 13, 
a model adapter subsystem 17, a noise sensor and a microphone 15 to produce recognized 
speech output from the recognizer 11. The model adapter subsystem 17 includes an 
estimator 19 and a model adapter 21. The noise sensor 17 detects background noise 
during a pre-utterance pause. The estimator 19 estimates the background noise (additive 
parameters) and convolutive parameters. The convolutive and additive estimates are 
applied to the model adapter 21. The adapter 21 modifies the HMM models 13 and the 
adapted models are used for the speech recognition in the recognizer 11. The model 
adapter 21 includes a processor that operates on the sensed noise, performs the estimates 
and modifies the models from the HMM source 13. 

[0022] The system recognizes utterances grouped speaker by speaker. The bias is 

re-estimated after recognizing each test utterance, and is initialized to zero at beginning 
of the recognition of all speakers. 

[0023] As the recognizer does not have to wait for channel estimation, which 

would introduce at least a delay of the duration of the utterance; the recognition result is 
available as soon as the utterance is completed. The processor in the model adapter 
subsystem 17 operates according to the steps as follows: 

1 . Set channel H 1 to zero. 

2. Start recognition of the current utterance. 

3. Estimate background noise N ' with pre-utterance pause. 
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4. For each HMM, modify Gaussian distributions with current H 1 and N 1 estimates 

4.1 Static MFCC (Equation 6) 

4.2 First order dynamic MFCC (Equation 13) 

4.3 Second order dynamic MFCC (Equation 14) 

5. Recognize the current utterance using the adapted HMM 

6. Estimate channel parameter with alignment information (Equation 9) 

6.1 For each segment r 

6.1.1 Calculation of y r t G,k) 

6.1.2 Accumulate statistics 

6.1.3 Update channel parameters 

6.2 H [i + i] * H[,j go to 6.1 

7. Go to step 2. 

[0024] In step 1, the channel H' is set to zero. In step 2, the current utterance is 
recognized. The background noise is estimated using sensed noise during a pre- 
utterance pause in step 3. For each HMM modify Gaussian distribution with current 
channel H 7 and noise N' estimates by compensating static MFCC using E{Y 7 } A m l = 

g(m' , H ; , B') from equation 6 where m l is the original Gaussian mean vector and m 
is the Gaussian mean vector compensated for the distortions caused by channel and 
environment noise and compensating for time derivatives by compensating first order 

rfc) .1 



dynamic MFCC using y = IDFT 



r>(k)+l Y ) 



from equation 13 and compensating 
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second order dynamic MFCC using y° = IDFT 



V{k) 



from 



equation 14. The compensated first order derivative of MFCC Y c is calculated from 
the original first order time derivative Y' . The next step 5 is to recognize the current 
utterance using the modified or adapted HMM models. In step 6 is the estimation of 
the channel component with alignment information using an iterative method which 
at iteration i+1 gives a new estimate H^+ijof H 1 using: 



H l [ i+ i] -H' w - 



R T 

II 

r=lt=\ 



I X y r M k ^,k>nh]> Bl )- DFT (°t) 



j gQs keQm 



R T 



EE I It; 0. 



[0025] The operations in step 6 include for each segment of utterance r, 

calculating y r t Q\k) accumulating statistics and update channel parameters. y r t Q\k) is the 
joint probability of state j and mixing component k at time t of the utterance r, given 
observed speech o[ and current model parameters, and 



h(X l ,H',B)(k) = 



with 



rj(k)= cxp(X'(k) + H\k) - B 7 (k)). 
[0026] If H 7 [h-i] 5t the final iteration of segment r H [ (] then repeat the steps of the 

calculation, accumulate and update. 
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[0027] In step 7 the channel estimate repeats for the next current utterance. As 

illustrated in Figure 3 the estimates of the channel and noise parameters from past 
utterances are used for the recognition of the current utterance. The quantities allowing 
the determination of optimum channel estimate and obtained during the recognition of 
these past utterances are carried over to the current estimate. 

EXTENSIONS OF THE OPERATING RANGE OF JAC 
INTRODUCTION 

[0028] While jointly estimating and compensating for additive (acoustic) and 

convolutive (channel) noise allows for a better recognition performance, special attention 
must be paid for low SNR speech signals. When the SNR is too low or the noise is 
highly non-stationary, it becomes difficult to make a correct noise estimate B . In that 
case, the channel estimate H[ M] will not only reflect channel mismatch but will also 
represent residual additive noise, erratic in nature. A solution to this problem consists of 
adding inertia to JAC channel estimate and to force the amplitude of channel estimates to 
be within a certain range, as explained hereafter. 

INERTIA ADDED TO JAC CHANNEL ESTIMATION 

[0029] At the beginning of a recognition task in a particular noise and channel 

condition, the recognizer may be suddenly exposed to a new type of background noise 
and microphone. It may be hard of the JAC algorithm to immediately give a good 
estimate of the channel after one utterance, since not enough statistics have been 
collected to represent the channel. 
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[0030] A solution consists of separating the running channel estimate H'[i + l] 

from the channel estimate H[ M] used for model compensation (equation 5). H( M] 
approaches H'[i + 1] gradually, as more channel statistics are collected with the 
increasing number q of observed utterances. 

[0031] A general equation to represent how H' [M] can slowly converge to 

H'[i + l] as more data is collected is, 

H[ M] = f{q, V) H[ MV H' {i] ), (10) 

where ti is an estimate of the sentence SNR. Our preferred method of embodiment uses 
the following equation 

where the function co(q,rf) is a weighting function which increases 
monotonically as the amount of collected channel statistics over q sentences increase. In 
our preferred method of embodiment, we use the following weighting function 

Q(v) 

where Q(r\) represents an SNR-dependent value. 

[0032] After Q(r\) have been observed in a particular environment, we have 

sufficient statistics on the channel to allow for the channel estimate used in acoustic 
model compensation to be equal to the running channel estimate. As can be seen in 
equation 12, the function ®(q,r\) increases monotonically and linearly with q to equal 1 



12 



TI-37331 

after Q(r\) sentences. If speech is clear Q is small and if not good SNR Q is larger. If Q is 
1000 and there are q=100 utterances the value is heavily discounted at .01 . If the value of 
q is small but SNR is good, then Q is small and the weighting function is large. 
[0033] In summary, in our preferred method of embodiment, after an SNR- 

dependent Q(r\) utterances have been recognized, we have H l {i + X] = H[ M] . When q< 

GCn),5J +l] is given by 

H \M\ = H V) + ^j) H[M] ~ H ^' (13) 

[0034] However, this is only one particular case of equation 10, which best 

represents the method conceptually. 

[0035] Adding such inertia mainly helps when facing suddenly new noises or 

channel conditions, where any channel estimate, based only very few frames of 
observation, can be misleading and degrade subsequent recognition tasks. 
LIMITS ON JAC CHANNEL ESTIMATION 

[0036] Despite the additional robustness provided by the new iterative procedure 

of equation 10, the channel estimate can still be inaccurate, especially with sudden 
exposure to new noise and channel conditions at low SNR. In this case, it may be 
beneficial to limit the amplitudes of the channel estimate that can be applied by JAC. 
This is done by forcing the amplitudes of the channel estimates to be within a certain 

range. The new channel estimate, H[ M] , is the one that will be used for JAC 
compensation. 

[0037] This means mathematically that H' [M] is a function of the unlimited 

channel estimate H( i+U and the sentence SNR r| as follows, 
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H lM] l = g(H; M] ,rj), (14) 
where the function g is monotonically increasing with//^ . 

[0038] In our preferred method of embodiment, we chose to limit the ranges that 

the JAC channel estimate can take by forcing it to be within the lower and upper limits 
and which are themselves functions of the sentence SNR . Equation 14 becomes then 



B{ M] if ^(/7)<r sup (/7) (15) 



[0039] In our particular implementation, we set the range to be symmetric around 

0 dB (-r inf =r sup =r). Equation 15 limits H^ M] outside the interval [-t,t], and keeps 

H( i+l] unaltered when /fj +l] e [-y,y] . 

[0040] Such a function guarantees that H{ M] e [-t(?j),t (77)] . Note also that if we 

force x(r|) =0, we have H l [Q] = 0 , and only background noise compensation can be 
applied. 

[0041] Such limitation on the channel estimate H l [M] will prevent the running 

channel estimate from too severely diverging from the true channel estimate in the many 
cases where that would be possible, such as: 

• When the acoustic noise becomes too large or time-varying to be sufficiently 
estimated (equation 7) and compensated for, 
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• When recognition segmentation is mainly erroneous, which means that 
incorrect HMM models are being used in the channel estimation process, 
leading to rapidly growing channel estimate H^ M] to compensate, 

• When the previous channel estimate is so inaccurate that the recognizer can no 
longer operate on and no valid surviving path is found in the Viterbi 
recognition algorithm, in which case JAC estimation is skipped all together. 

[0042] Between the limits it is made linear and above and below the limits it is 

constant. The values are constant. The determination of the limits is determined by 
experimentation. A value may be between zero and plus and minus 3. 
[0043] Figure 4 illustrates the block diagram for the computation of the original 

running channel estimate H l [M] , the inertia-added channel estimate H( M] and the range 

limited channel estimate H[ M ^ . The elements are added to the parameter estimator 19 in 

Figure 1. The D is the delay between channel estimates(the next utterance in the 
embodiment). 

PERFORMANCE EVALUATION 
EXPERIMENTAL CONDITIONS 

[0044] The performance of the proposed Enhanced- JAC back-end algorithm (E- 

JAC) was evaluated on the Aurora-2 database for the purpose of benchmarking it with the 
new ETSI Advanced Front-End (AFE) standard. See D. Macho et al. "Evaluation of a 
noise-robust DSR front-end on AURORA database," in Proc. Int. Conf. on Spoken 
Language Processing, Colorado, USA, September 2002, pp. 17-20 
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[0045] The standard Aurora-2 testing procedure was used, which averages 

recognition performance over 10 different noise conditions (two with channel mismatch 
in Test C) at 5 different SNR levels (20dB, 15dB, lOdB, 5dB and MB). Since the clean 
data and the data at -5dB are not used in the average performance evaluation, we have not 
tested our algorithm at those noise levels. 

[0046] As a reminder, performance of the AFE standard on the Aurora-2 database 

is established using the following configuration: a 39-dimensional feature vector (13 
AFE features with 1 st and 2 nd order derivatives) extracted every 10 ms and 16 states word 
HMM models with 20 Gaussian mixtures per state. According to the official baseline for 
Eurospeech 2003, the average performance of the AFE standard over the entire database 
is 88.19%, which breaks down in the following percentages for each SNR condition, 
from 20dB to OdB: 98.92, 97.78, 94.61, 85.99 and 63.66. 

[0047] In the evaluation of our solution, we move slightly away from the feature 

vector being used, while keeping the HMM model topology the same. A 32-dimensional 
feature vector is used, which corresponds to a 16 dimensional MFCC vector and its 1 st 
order derivative only. For memory/benefit ratio concerns, Texas Instruments 
Incorporated low footprint solution typically does not use second order derivatives. 
While better results could be obtained by using the second order derivatives, it was 
decided not to use the acceleration features. Note that this means that our system 
operates on fewer features (about 20% less) than the AFE standard. In the experiments, a 
speech model variance adjustment is also applied. 



PERFORMANCE OF E-JAC BACK-END 
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[0048] Figure 5 is a table that summarizes the performance of our back-end 

solution on the Aurora-2 database using E-JAC with 2 passes. It can be seen that we 
obtain an average performance level of 91.86%, which corresponds to a 31% relative 
improvement over AFE (88.19%). Note that such results are obtained after performing 
two passes of the E-JAC decoding algorithm. With only one pass, the results would be 
91.47%. This indicates that most of the performance improvement and noise robustness 
comes from the E-JAC algorithm, and not from the two passes. 

[0049] The important information to extract from the table of Figure 5, besides 

the absolute results, is that the E-JAC algorithm works on every single noise condition, 
even at 0 dB SNR, where noise and channel energies are equal. 

[0050] This was not the case for the classic JAC algorithm, which would 

regularly fail to operate at SNRs less or sometimes equal to 10 dB. The reason for such 
failure would be that residual noise after background noise compensation would interfere 
with channel estimation, which in turn, after acoustic model compensation, would 
degrade the segmentation accuracy for the next sentence and therefore the channel 
compensation again. If there is no progressive channel estimate being done at the 
beginning of an exposure to new noise and channel condition and if no limit on channel 
estimation can be imposed, the JAC channel estimator can rapidly diverge from the true 
channel estimate. 

[0051] Although preferred embodiments have been described, it will be apparent 

to those skilled in the art that various modifications, additions, substitutions and the like 
can be made without departing from the spirit of the invention and these are therefore 
considered to be.within the scope of the invention as defined in the following claims. 
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